-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fixes bug where multiple nodes fight over an eip #435
Conversation
10ee2a8
to
e2c9b82
Compare
e2c9b82
to
5699eed
Compare
- introduces resource_ids to eip status - moves setting attachment status to be before an attempt at attaching - moves status update from patch to replace this should raise errors if the resource has changed allowing the resource_id to function somewhat as a lock - resource won't attempt to claim a resource with a resource id - fixes cargo deny errors
5699eed
to
484a96e
Compare
eip_operator/src/controller/node.rs
Outdated
Err(err) | ||
if err | ||
.to_string() | ||
.contains("Operation cannot be fulfilled on eips.materialize.cloud") => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What error type is this? If it is from Kubernetes or something we defined, we should be able to match a more specific type than just doing string matching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is an eip_operator::Error that comes from a kube::Error. I want to say I attempted to match an error, but this was just easier and iirc I might've needed a guard regardless.
eip_operator/src/controller/node.rs
Outdated
"Pod {} failed to claim eip {}, rescheduling to try another", | ||
name, eip_name | ||
); | ||
return Ok(Some(Action::requeue(Duration::from_secs(1)))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we have this special handling at all? if the reconcile loop returns an error, it will already log an error and schedule a rereconcile in the very near future - i don't quite understand what we gain from handling this manually
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're expecting this as a valid possibility due to a conflict and we're able to appropriately handle it is there really an error? I can just return the error, but I think it makes logs more confusing, than having an info message that indicates this sort of thing was expected.
bd85c36
to
4f7efc7
Compare
1edfaac
to
1c3cb13
Compare
1c3cb13
to
b3bd1b0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! I have one remaining question and a few nitpicks, but all the logic looks correct to me.
s.resource_id.is_none() | ||
|| s.resource_id.as_ref().map(|r| r == &name).unwrap_or(false) | ||
}) | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we handle the migration from older versions of the EIP operator, which don't have resource_id's? Is this OK just because we don't have multiple nodes that match currently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, there could be some re-alignment during this migration if there are multiple nodes matching a single eip, but it shouldn't be any more unstable than the existing implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a bunch of nitpicks in my previous review comments, but won't block the PR on those.
Alternative to #348
I believe this still has issues:
The main issue with an approach like this:
when a node/pod drops an eip we're left up to the pod/node reconciliation cycle to reassign. The EIP will reconcile due status change, but without adding considerable amount of code like was done in 348, EIPs won't attach themselves to resource or cause resources to re-reconcile leading them setting up the attachment to the eips.
^ Update to this
I've now pushed a change where the eip will use a dynamic resource provider to find all resource matching it's selector and update a label with a timestamp leading the resource to be reconciled. This seems to work really well. Let me know if there are concerns.